11-cs01-data

Professor Shannon Ellis

2/14/23

CS01: Right-To-Carry (Data)

Q&A

Q: Is there ever a time when we should use \(R^2\) instead of adjusted R^2 when analyzing a model?
A: When talking about variance explained of a single model, \(R^2\) is great, but when comparing across models, you’ll always want to use adjusted \(R^2\)

Q: How do we use the data from OCS for our case study? Should we merge the data files?
A: Excellent question! That’s what today’s lecture is all about. There’s a whole lot of wrangling to do before we can use these data!

Q: Do we need to look at the p-value when we do analysis? (Midterm01)
A: When interpreting a model, no. When doing hypothesis testing (we’ll get there), it is one piece you can look at. A few people did interpret p-values on the midterm, and that’s ok! (But it was not required.)

Q: Can we have a system where we can find other students to group with? Like a google form?
A: Great question! I’ll start a pinned thread on Campuswire so you all can find one another.

Q: I think the data seems pretty confusing.
A: That’s b/c it is! We’ve got a lot of work to do to get it into a usable/understandable format.

Q: In what context of data we should use interaction model or main effect model?
A: Interaction terms should be included when the relationship between one predictor and the outcome varies by another predictor.

Course Announcements

Due Dates:

  • Lab 05 due tomorrow (2/17; 11:59 PM)
  • mid-course survey (optional for EC) due tomorrow (2/17; 11:59 PM)
  • Lecture Participation survey “due” after class

Notes:

  • CS01
    • instructions posted on website
    • invited to GH repo (accept invitation, please!)
    • have an email with other group mates
    • goal: meet more people in the class & work together
  • HW03 posted
  • Reminder to think about final project group mates; thread on campuswire

Agenda

  • Background
  • Data Intro
  • Wrangle
  • Combine!

Background

Right To Carry Laws1

Right to Carry (RTC) Laws - “a law that specifies if and how citizens are allowed to have a firearm on their person or nearby (for example, in a citizen’s car) in public.”2

RTC Laws (cont’d)

  • The Second Amendment to the United States Constitution guarantees the right to “keep and bear arms”. The amendment was ratified in 1791 as part of the Bill of Rights.
  • There are no federal laws about carrying firearms in public.
  • These laws are created and enforced at the US state level. States vary greatly in their laws about the right to carry firearms.
  • Some require extensive effort to obtain a permit to legally carry a firearm, while other states require very minimal effort to do so. An increasing number of states do not require permits at all.

RTC Laws Across the US

RTC Laws Across the US

Questions

  1. What is the relationship between right to carry laws and violence rates in the US?
  2. What is the effect of multicollinearity on coefficient estimates from linear regression models when analyzing right to carry laws and violence rates?

The Data

The Data: Source

Two contradictory analyses:

  1. John J. Donohue et al., Right‐to‐Carry Laws and Violent Crime: A Comprehensive Assessment Using Panel Data and a State‐Level Synthetic Control Analysis. Journal of Empirical Legal Studies, 16,2 (2019).

  2. David B. Mustard & John Lott. Crime, Deterrence, and Right-to-Carry Concealed Handguns. Coase-Sandor Institute for Law & Economics Working Paper No. 41, (1996).

The Data

Two Analyses

Limitations

  1. The analyses differed in variables used; we will not be recreating either analysis in full
  2. We’ll account for either the adoption or lack of adoption of a permissive right-to-carry law in each state; we will not account for differences in the level of permissiveness of the laws.
  3. Race is included here (as it was in initial analysis); however, any association between demographic variables (indicating the proportion of the population from specific race and age groups) and violent crime does not necessarily indicate that the two are linked causally.

Packages

library(OCSdata) # will need to be installed
library(tidyverse)
library(pdftools)
library(readxl)

Raw Data

There are a whole bunch of different data files we’ll be using…

# only get the data once
OCSdata::load_raw_data("ocs-bp-RTC-wrangling", outpath = '.')
  • creates a “data” sub-directory in your current working directory (if it does not already exist)
  • creates a “raw” sub-directory within “data”; contains the directories with the data

👉 Your Turn: Load the data into RStudio. It will take a while…so just let it get started.

The Goal

Get two datasets (Lott, Donohue) that contain demographic, population, police staffing, unemployment, violent crime, RTC, and poverty information at the state level across time.

❓ What would be the tidy way to store these data?

Your Turn

🧠 Take a look in one of the data folders, open at least one of the data files to view it, and try to get a sense of the type of information contained within it.

Demographic & Population Data

dem_77_79 <- read_csv("data/raw/Demographics/Decade_1970/pe-19.csv", skip = 5)

dem_80_89 <- list.files(recursive = TRUE,
                  path = "data/raw/Demographics/Decade_1980",
                  pattern = "*.csv",
                  full.names = TRUE) |> 
  purrr::map(~read_csv(., skip=5))

dem_90_99 <- list.files(recursive = TRUE,
                  path = "data/raw/Demographics/Decade_1990",
                  pattern = "*.txt",
                  full.names = TRUE) |> 
  map(~read_table2(., skip = 14))

dem_00_10 <- list.files(recursive = TRUE,
                  path = "data/raw/Demographics/Decade_2000",
                  pattern = "*.csv",
                   full.names = TRUE) |> 
   map(~read_csv(.))

Source: US Census Bureau Data

glimpse(dem_00_10[[1]])
Rows: 62,244
Columns: 21
$ REGION            <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
$ DIVISION          <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
$ STATE             <chr> "0", "0", "0", "0", "0", "0", "0", "0", "0", "0", "0…
$ NAME              <chr> "United States", "United States", "United States", "…
$ SEX               <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
$ ORIGIN            <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
$ RACE              <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
$ AGEGRP            <dbl> 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15…
$ ESTIMATESBASE2000 <dbl> 281424600, 19176154, 20549855, 20528425, 20218782, 1…
$ POPESTIMATE2000   <dbl> 282162411, 19178293, 20463852, 20637696, 20294955, 1…
$ POPESTIMATE2001   <dbl> 284968955, 19298217, 20173362, 20978678, 20456284, 1…
$ POPESTIMATE2002   <dbl> 287625193, 19429192, 19872417, 21261421, 20610370, 2…
$ POPESTIMATE2003   <dbl> 290107933, 19592446, 19620851, 21415353, 20797166, 2…
$ POPESTIMATE2004   <dbl> 292805298, 19785885, 19454237, 21411680, 21102552, 2…
$ POPESTIMATE2005   <dbl> 295516599, 19917400, 19389067, 21212579, 21486214, 2…
$ POPESTIMATE2006   <dbl> 298379912, 19938883, 19544688, 21033138, 21807709, 2…
$ POPESTIMATE2007   <dbl> 301231207, 20125962, 19714611, 20841042, 22067816, 2…
$ POPESTIMATE2008   <dbl> 304093966, 20271127, 19929602, 20706655, 22210880, 2…
$ POPESTIMATE2009   <dbl> 306771529, 20244518, 20182499, 20660564, 22192810, 2…
$ CENSUS2010POP     <dbl> 308745538, 20201362, 20348657, 20677194, 22040343, 2…
$ POPESTIMATE2010   <dbl> 309349689, 20200529, 20382409, 20694011, 21959087, 2…

State FIPS Codes

STATE_FIPS <- readxl::read_xls("data/raw/State_FIPS_codes/state-geocodes-v2014.xls", skip = 5)
glimpse(STATE_FIPS)
Rows: 64
Columns: 4
$ Region          <chr> "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1",…
$ Division        <chr> "0", "1", "1", "1", "1", "1", "1", "1", "2", "2", "2",…
$ `State\n(FIPS)` <chr> "00", "00", "09", "23", "25", "33", "44", "50", "00", …
$ Name            <chr> "Northeast Region", "New England Division", "Connectic…
STATE_FIPS <- STATE_FIPS |>
  rename(STATEFP = `State\n(FIPS)`,
         STATE = Name) |>
  select(STATEFP, STATE) |>
  filter(STATEFP != "00")

STATE_FIPS
# A tibble: 51 × 2
   STATEFP STATE        
   <chr>   <chr>        
 1 09      Connecticut  
 2 23      Maine        
 3 25      Massachusetts
 4 33      New Hampshire
 5 44      Rhode Island 
 6 50      Vermont      
 7 34      New Jersey   
 8 36      New York     
 9 42      Pennsylvania 
10 17      Illinois     
# … with 41 more rows

Police Staffing Data

There’s an issue currently with the ps_data file from OCS, so we’ll use this file instead:

ps_data <- read_csv("https://raw.githubusercontent.com/COGS137/datasets/main/pe_1960_2018.csv")
glimpse(ps_data)
Rows: 2,242
Columns: 3
$ data_year           <dbl> 1977, 1977, 1977, 1977, 1977, 1977, 1977, 1977, 19…
$ state_abbr          <chr> "AK", "AL", "AR", "AS", "AZ", "CA", "CO", "CT", "C…
$ officer_state_total <dbl> 544, 7380, 3344, 0, 6414, 65596, 7337, 6051, 0, 47…

Unemployment Data

unemployment

ue_rate_data <- list.files(recursive = TRUE,
                          path = "data/raw/Unemployment",
                          pattern = "*.xlsx",
                          full.names = TRUE) |> 
map(~read_xlsx(., skip = 10))

ue_rate_names <- list.files(recursive = TRUE,
                          path = "data/raw/Unemployment",
                          pattern = "*.xlsx",
                          full.names = TRUE) |>
map(~read_xlsx(., range = "B4:B6")) |>
  map(c(1,2)) |>
unlist()

names(ue_rate_data) <- ue_rate_names
head(ue_rate_data)[1]
$Alabama
# A tibble: 44 × 14
    Year   Jan   Feb   Mar   Apr   May   Jun   Jul   Aug   Sep   Oct   Nov   Dec
   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
 1  1977   7.5   9     7.7   7.2   6.8   8.6   8     7.8   6.7   6.3   6.3   6  
 2  1978   7.1   6.9   6.2   5.4   5.1   6.9   6.7   6.7   6.5   6.3   6.3   6.5
 3  1979   6.7   7.5   6.9   6.6   6.4   8.4   7.7   7.8   7.1   7.2   6.9   6.7
 4  1980   7.7   7.8   7.4   7.4   8.4   9.7  10.4  10.3   9.3   9.6   9.4   9  
 5  1981  10    10.3   9.5   9.1   9.4  11.1  10.4  10.9  10.8  11.7  11.5  11.8
 6  1982  13.2  13.2  12.9  12.6  12.8  14.5  14.7  14.8  14.7  15.1  15.4  15.3
 7  1983  16    16    14.5  13.7  13.3  14.6  13.9  13.8  13.2  12.8  12.1  11.8
 8  1984  12.5  12.4  11.4  10.8  10.1  11.3  11.5  11.3  10.8  10.2   9.7  10.1
 9  1985  10.7  10.5   9.8   8.7   8.4   9.6   9.2   8.8   8.6   8.6   8.4   8.7
10  1986   9.3  10.4  10.1   9.4   9.4  10.5   9.7   9.6   9.7   9.7   9.6   9  
# … with 34 more rows, and 1 more variable: Annual <dbl>
ue_rate_data <- ue_rate_data |>
  map_df(bind_rows, .id = "STATE") |>
  select(STATE, Year, Annual) |>
  rename("YEAR" = Year,
         "VALUE" = Annual) |>
  mutate(VARIABLE = "Unemployment_rate")

ue_rate_data
# A tibble: 2,244 × 4
   STATE    YEAR VALUE VARIABLE         
   <chr>   <dbl> <dbl> <chr>            
 1 Alabama  1977   7.3 Unemployment_rate
 2 Alabama  1978   6.4 Unemployment_rate
 3 Alabama  1979   7.2 Unemployment_rate
 4 Alabama  1980   8.9 Unemployment_rate
 5 Alabama  1981  10.6 Unemployment_rate
 6 Alabama  1982  14.1 Unemployment_rate
 7 Alabama  1983  13.8 Unemployment_rate
 8 Alabama  1984  11   Unemployment_rate
 9 Alabama  1985   9.2 Unemployment_rate
10 Alabama  1986   9.7 Unemployment_rate
# … with 2,234 more rows

Poverty Data

poverty_rate_data <- read_xls("data/raw/Poverty/hstpov21.xls", skip=2)
head(poverty_rate_data)
# A tibble: 6 × 6
  `NOTE: Number in thousands.` ...2  ...3   ...4              ...5         ...6 
  <chr>                        <chr> <chr>  <chr>             <chr>        <chr>
1 2018                         <NA>  <NA>    <NA>             <NA>          <NA>
2 STATE                        Total Number "Standard\nerror" Percent      "Sta…
3 Alabama                      4877  779    "65"              16           "1.3"
4 Alaska                       720   94     "9"               13.1         "1.2"
5 Arizona                      7241  929    "80"              12.80000000… "1.1…
6 Arkansas                     2912  462    "38"              15.9         "1.3"

Source: US Census Bureau Data

Violent Crime Data

crime_data <- read_lines("data/raw/Crime/CrimeStatebyState.csv",
                         skip = 2, 
                         skip_empty_rows = TRUE)

Due to spaces and / in the column names, read_lines() from the readr package works better than read_csv()

head(crime_data)
[1] "Estimated crime in Alabama"                                                                                                         
[2] ",,National or state crime,,,,,,,"                                                                                                   
[3] ",,Violent crime,,,,,,,"                                                                                                             
[4] "Year,Population,Violent crime total,Murder and nonnegligent Manslaughter,Legacy rape /1,Revised rape /2,Robbery,Aggravated assault,"
[5] "1977,   3690000,      15293,         524,         929,,       3572,      10268 "                                                    
[6] "1978,   3742000,      15682,         499,         954,,       3708,      10521 "                                                    

Right-To-Carry Data

DAWpaper <- pdf_text("data/raw/w23510.pdf")
head(DAWpaper[1])
[1] "                              NBER WORKING PAPER SERIES\n\n\n\n                  RIGHT-TO-CARRY LAWS AND VIOLENT CRIME:\n             A COMPREHENSIVE ASSESSMENT USING PANEL DATA AND\n                 A STATE-LEVEL SYNTHETIC CONTROL ANALYSIS\n\n                                        John J. Donohue\n                                          Abhay Aneja\n                                         Kyle D. Weber\n\n                                      Working Paper 23510\n                              http://www.nber.org/papers/w23510\n\n\n                     NATIONAL BUREAU OF ECONOMIC RESEARCH\n                              1050 Massachusetts Avenue\n                                 Cambridge, MA 02138\n                           June 2017, Revised November 2018\n\n\n\nPreviously circulated as \"Right-to-Carry Laws and Violent Crime: A Comprehensive Assessment\nUsing Panel Data and a State-Level Synthetic Controls Analysis.\" We thank Dan Ho, Stefano\nDellaVigna, Rob Tibshirani, Trevor Hastie, StefanWager, Jeff Strnad, and participants at the\n2011 Conference of Empirical Legal Studies (CELS), 2012 American Law and Economics\nAssociation (ALEA) Annual Meeting, 2013 Canadian Law and Economics Association (CLEA)\nAnnual Meeting, 2015 NBER Summer Institute (Crime), and the Stanford Law School faculty\nworkshop for their comments and helpful suggestions. Financial support was provided by\nStanford Law School. We are indebted to Alberto Abadie, Alexis Diamond, and Jens\nHainmueller for their work developing the synthetic control algorithm and programming the Stata\nmodule used in this paper and for their helpful comments. The authors would also like to thank\nAlex Albright, Andrew Baker, Jacob Dorn, Bhargav Gopal, Crystal Huang, Mira Korb, Haksoo\nLee, Isaac Rabbani, Akshay Rao, Vikram Rao, Henrik Sachs and Sidharth Sah who provided\nexcellent research assistance, as well as Addis O’Connor and Alex Chekholko at the Research\nComputing division of Stanford’s Information Technology Services for their technical support.\nThe views expressed herein are those of the author and do not necessarily reflect the views of the\nNational Bureau of Economic Research.\n\nNBER working papers are circulated for discussion and comment purposes. They have not been\npeer-reviewed or been subject to the review by the NBER Board of Directors that accompanies\nofficial NBER publications.\n\n© 2017 by John J. Donohue, Abhay Aneja, and Kyle D. Weber. All rights reserved. Short\nsections of text, not to exceed two paragraphs, may be quoted without explicit permission\nprovided that full credit, including © notice, is given to the source.\n"

Save (Imported) Data

save(dem_77_79, dem_80_89, dem_90_99, dem_00_10, 
     STATE_FIPS, 
     ps_data, 
     ue_rate_data, 
     poverty_rate_data,
     crime_data,
     DAWpaper, file = "data/imported_data_rtc.rda")

Data Wrangling

Wrangle: Demo Data

dem_77_79 <- dem_77_79 |>
  rename("race_sex" =`Race/Sex Indicator`) |>
  mutate(SEX = str_extract(race_sex, "male|female"),
        RACE = str_extract(race_sex, "Black|White|Other"))|>
  select(-`FIPS State Code`, -`race_sex`) |>
  rename("YEAR" = `Year of Estimate`,
        "STATE" = `State Name`) |>
  filter(YEAR %in% 1977:1979)

dem_77_79 <- dem_77_79 |>
  pivot_longer(cols=contains("years"),
               names_to = "AGE_GROUP",
               values_to = "SUB_POP")

glimpse(dem_77_79)
Rows: 16,524
Columns: 6
$ YEAR      <dbl> 1977, 1977, 1977, 1977, 1977, 1977, 1977, 1977, 1977, 1977, …
$ STATE     <chr> "Alabama", "Alabama", "Alabama", "Alabama", "Alabama", "Alab…
$ SEX       <chr> "male", "male", "male", "male", "male", "male", "male", "mal…
$ RACE      <chr> "White", "White", "White", "White", "White", "White", "White…
$ AGE_GROUP <chr> "Under 5 years", "5 to 9 years", "10 to 14 years", "15 to 19…
$ SUB_POP   <dbl> 98814, 113365, 123107, 135343, 126053, 111547, 100674, 81038…
dem_80_89 <- dem_80_89 |>
  map_df(bind_rows)

dem_80_89 <- dem_80_89 |>
  rename("race_sex" =`Race/Sex Indicator`) |>
  mutate(SEX = str_extract(race_sex, "male|female"),
        RACE = str_extract(race_sex, "Black|White|Other"))|>
  select( -`race_sex`) |>
  rename("YEAR" = `Year of Estimate`) |> 
  rename("STATEFP_temp" = "FIPS State and County Codes") |>
  mutate(STATEFP = str_sub(STATEFP_temp, start = 1, end = 2)) |>
    left_join(STATE_FIPS, by = "STATEFP") |>
  select(-STATEFP)

dem_80_89 <- dem_80_89 |>
  pivot_longer(cols=contains("years"),
               names_to = "AGE_GROUP",
               values_to = "SUB_POP_temp") |>
  group_by(YEAR, STATE, AGE_GROUP, SEX, RACE) |>
  summarize(SUB_POP = sum(SUB_POP_temp), .groups="drop")

dem_80_89
# A tibble: 55,080 × 6
    YEAR STATE   AGE_GROUP      SEX    RACE  SUB_POP
   <dbl> <chr>   <chr>          <chr>  <chr>   <dbl>
 1  1980 Alabama 10 to 14 years female Black   50108
 2  1980 Alabama 10 to 14 years female Other     805
 3  1980 Alabama 10 to 14 years female White  109066
 4  1980 Alabama 10 to 14 years male   Black   50768
 5  1980 Alabama 10 to 14 years male   Other     826
 6  1980 Alabama 10 to 14 years male   White  115988
 7  1980 Alabama 15 to 19 years female Black   58428
 8  1980 Alabama 15 to 19 years female Other     743
 9  1980 Alabama 15 to 19 years female White  126783
10  1980 Alabama 15 to 19 years male   Black   56808
# … with 55,070 more rows
dem_90_99 <- dem_90_99 |>
  map_df(bind_rows)

colnames(dem_90_99) <- c("YEAR", "STATEFP", "Age", "NH_W_M", "NH_W_F", "NH_B_M",
                         "NH_B_F", "NH_AIAN_M", "NH_AIAN_F", "NH_API_M", "NH_API_F",
                         "H_W_M", "H_W_F", "H_B_M", "H_B_F", "H_AIAN_M", "H_AIAN_F",
                         "H_API_M", "H_API_F")

dem_90_99 <- dem_90_99 |>
  drop_na() |>
  mutate(W_M = NH_W_M + H_W_M, W_F = NH_W_F + H_W_F,
         B_M = NH_B_M + H_B_M, B_F = NH_B_F + H_B_F,
         AIAN_M = NH_AIAN_M + H_AIAN_M, AIAN_F = NH_AIAN_F + H_AIAN_F,
         API_M = NH_API_M + H_API_M, API_F = NH_API_F + H_API_F) |>
  select(-starts_with("NH_"), -starts_with("H_"))

dem_90_99 <- dem_90_99 |>
  mutate(AGE_GROUP = cut(Age,
                         breaks = seq(0,90, by=5),
                         right = FALSE, labels = pull(distinct(dem_77_79,AGE_GROUP), AGE_GROUP))) |>
  select(-Age) |>
  pivot_longer(cols = c(starts_with("W_"),
                        starts_with("B_"),
                        starts_with("AIAN_"),
                        starts_with("API_")),
               names_to = "RACE",
               values_to = "SUB_POP_temp") |>
  mutate(SEX = case_when(str_detect(RACE, "_M") ~ "Male",
                         TRUE ~ "Female"),
         RACE = case_when(str_detect(RACE, "W_") ~ "White",
                          str_detect(RACE, "B_") ~ "Black",
                          TRUE ~ "Other"))

dem_90_99 <- dem_90_99 |>
  left_join(STATE_FIPS, by = "STATEFP") |>
  select(-STATEFP) |>
  group_by(YEAR, STATE, AGE_GROUP, SEX, RACE) |>
  summarize(SUB_POP = sum(SUB_POP_temp), .groups="drop")

glimpse(dem_90_99)
Rows: 55,080
Columns: 6
$ YEAR      <dbl> 1990, 1990, 1990, 1990, 1990, 1990, 1990, 1990, 1990, 1990, …
$ STATE     <chr> "Alabama", "Alabama", "Alabama", "Alabama", "Alabama", "Alab…
$ AGE_GROUP <fct> Under 5 years, Under 5 years, Under 5 years, Under 5 years, …
$ SEX       <chr> "Female", "Female", "Female", "Male", "Male", "Male", "Femal…
$ RACE      <chr> "Black", "Other", "White", "Black", "Other", "White", "Black…
$ SUB_POP   <dbl> 45377, 1406, 92380, 46635, 1360, 98524, 46067, 1698, 92530, …
dem_00_10 <- dem_00_10 |>
  map_df(bind_rows)

dem_00_10 <- dem_00_10 |>
  select(-ESTIMATESBASE2000,-CENSUS2010POP) |>
  filter(NAME != "United States",
         SEX != 0,
         RACE != 0,
         AGEGRP != 0, 
         ORIGIN == 0) |>
  select(-REGION, -DIVISION, -ORIGIN, -STATE) |>
  rename("STATE" = NAME,
         "AGE_GROUP" = AGEGRP)

dem_00_10 <- dem_00_10 |>
  mutate(SEX = factor(SEX, levels = 1:2, labels = c("Male", "Female")),
         RACE = factor(RACE, levels = 1:6, labels = c("White", "Black", rep("Other",4))),
         AGE_GROUP = factor(AGE_GROUP, levels = 1:18,
                            labels = pull(distinct(dem_77_79,AGE_GROUP), AGE_GROUP)))

dem_00_10 <- dem_00_10 |>
  pivot_longer(cols=contains("ESTIMATE"), names_to = "YEAR", values_to = "SUB_POP_temp") |>
   mutate(YEAR = str_sub(YEAR, start=-4),
          YEAR = as.numeric(YEAR)) |> 
  group_by(YEAR, AGE_GROUP, STATE, SEX, RACE) |>
  summarize(SUB_POP = sum(SUB_POP_temp), .groups = "drop")
                            
glimpse(dem_00_10)
Rows: 60,588
Columns: 6
$ YEAR      <dbl> 2000, 2000, 2000, 2000, 2000, 2000, 2000, 2000, 2000, 2000, …
$ AGE_GROUP <fct> Under 5 years, Under 5 years, Under 5 years, Under 5 years, …
$ STATE     <chr> "Alabama", "Alabama", "Alabama", "Alabama", "Alabama", "Alab…
$ SEX       <fct> Male, Male, Male, Female, Female, Female, Male, Male, Male, …
$ RACE      <fct> White, Black, Other, White, Black, Other, White, Black, Othe…
$ SUB_POP   <dbl> 99527, 46595, 4487, 94473, 45672, 4431, 14765, 1039, 8572, 1…

Wrangle: Population Data

pop_77_79 <- dem_77_79 |>
  group_by(YEAR, STATE) |>
  summarize(TOT_POP = sum(SUB_POP), .groups = "drop") 

pop_77_79 
# A tibble: 153 × 3
    YEAR STATE                 TOT_POP
   <dbl> <chr>                   <dbl>
 1  1977 Alabama               3782571
 2  1977 Alaska                 397220
 3  1977 Arizona               2427296
 4  1977 Arkansas              2207195
 5  1977 California           22350332
 6  1977 Colorado              2696179
 7  1977 Connecticut           3088745
 8  1977 Delaware               594815
 9  1977 District of Columbia   681766
10  1977 Florida               8888806
# … with 143 more rows
pop_80_89 <- dem_80_89 |>
  group_by(YEAR, STATE) |>
  summarize(TOT_POP = sum(SUB_POP), .groups = "drop") 
pop_90_99 <- dem_90_99 |>
  group_by(YEAR, STATE) |>
  summarize(TOT_POP = sum(SUB_POP), .groups = "drop")
pop_00_10 <- dem_00_10 |>
  group_by(YEAR, STATE) |>
  summarize(TOT_POP = sum(SUB_POP), .groups = "drop")

Combine: Demo + Population

dem_77_79 <- dem_77_79 |>
  left_join(pop_77_79, by = c("YEAR", "STATE")) |> 
  mutate(PERC_SUB_POP = (SUB_POP/TOT_POP)*100) |>
  select(-SUB_POP, -TOT_POP) |>
  mutate(SEX = str_to_title(SEX))


dem_77_79
# A tibble: 16,524 × 6
    YEAR STATE   SEX   RACE  AGE_GROUP      PERC_SUB_POP
   <dbl> <chr>   <chr> <chr> <chr>                 <dbl>
 1  1977 Alabama Male  White Under 5 years          2.61
 2  1977 Alabama Male  White 5 to 9 years           3.00
 3  1977 Alabama Male  White 10 to 14 years         3.25
 4  1977 Alabama Male  White 15 to 19 years         3.58
 5  1977 Alabama Male  White 20 to 24 years         3.33
 6  1977 Alabama Male  White 25 to 29 years         2.95
 7  1977 Alabama Male  White 30 to 34 years         2.66
 8  1977 Alabama Male  White 35 to 39 years         2.14
 9  1977 Alabama Male  White 40 to 44 years         1.98
10  1977 Alabama Male  White 45 to 49 years         2.02
# … with 16,514 more rows
dem_80_89 <- dem_80_89 |>
  left_join(pop_80_89, by = c("YEAR", "STATE")) |>
  mutate(PERC_SUB_POP = (SUB_POP/TOT_POP)*100) |>
  select(-SUB_POP, -TOT_POP) |>
  mutate(SEX = str_to_title(SEX))
dem_90_99 <- dem_90_99 |>
  left_join(pop_90_99, by = c("YEAR", "STATE")) |>
  mutate(PERC_SUB_POP = (SUB_POP/TOT_POP)*100) |>
  select(-SUB_POP, -TOT_POP)

dem_90_99
# A tibble: 55,080 × 6
    YEAR STATE   AGE_GROUP     SEX    RACE  PERC_SUB_POP
   <dbl> <chr>   <fct>         <chr>  <chr>        <dbl>
 1  1990 Alabama Under 5 years Female Black       1.12  
 2  1990 Alabama Under 5 years Female Other       0.0347
 3  1990 Alabama Under 5 years Female White       2.28  
 4  1990 Alabama Under 5 years Male   Black       1.15  
 5  1990 Alabama Under 5 years Male   Other       0.0336
 6  1990 Alabama Under 5 years Male   White       2.43  
 7  1990 Alabama 5 to 9 years  Female Black       1.14  
 8  1990 Alabama 5 to 9 years  Female Other       0.0419
 9  1990 Alabama 5 to 9 years  Female White       2.29  
10  1990 Alabama 5 to 9 years  Male   Black       1.16  
# … with 55,070 more rows
dem_00_10 <- dem_00_10 |>
  left_join(pop_00_10, by = c("YEAR", "STATE")) |>
  mutate(PERC_SUB_POP = (SUB_POP/TOT_POP)*100) |>
  select(-SUB_POP, -TOT_POP)

dem_00_10
# A tibble: 60,588 × 6
    YEAR AGE_GROUP     STATE   SEX    RACE  PERC_SUB_POP
   <dbl> <fct>         <chr>   <fct>  <fct>        <dbl>
 1  2000 Under 5 years Alabama Male   White       2.24  
 2  2000 Under 5 years Alabama Male   Black       1.05  
 3  2000 Under 5 years Alabama Male   Other       0.101 
 4  2000 Under 5 years Alabama Female White       2.12  
 5  2000 Under 5 years Alabama Female Black       1.03  
 6  2000 Under 5 years Alabama Female Other       0.0995
 7  2000 Under 5 years Alaska  Male   White       2.35  
 8  2000 Under 5 years Alaska  Male   Black       0.165 
 9  2000 Under 5 years Alaska  Male   Other       1.37  
10  2000 Under 5 years Alaska  Female White       2.26  
# … with 60,578 more rows

❗ This would be a good part of the code to write a user-defined function…

Aside: UDF

  • User-defined functions (UDFs) are functions you write to make your code cleaner
  • Any time you copy+paste very similar code, think to yourself…I should make this a function!

The general syntax for a function in R is:

function_name <- function(parameters){
  # code to carry out
  # using the parameters
}

Note: by default the last object created within the function is returned from the function

combine_demo_pop <- function(df_dem, df_pop){
  df_dem <- df_dem |>
  group_by(YEAR, STATE) |>
  summarize(TOT_POP = sum(SUB_POP), .groups = "drop")
  
  df_dem |>
  left_join(df_pop, by = c("YEAR", "STATE")) |>
  mutate(PERC_SUB_POP = (SUB_POP/TOT_POP)*100) |>
  select(-SUB_POP, -TOT_POP) |>
  mutate(SEX = str_to_title(SEX))
}
combined_df <- combine_demo_pop(dem_00_10, pop_00_10)
  • Note: if you’ve already combined the data into dem_00_10, this would not work.
  • Cleaning up/improving code in your case studies is encouraged!

Combine: Demo/Pop Across Decades

dem <- bind_rows(dem_77_79,
                 dem_80_89,
                 dem_90_99,
                 dem_00_10)

glimpse(dem)
Rows: 187,272
Columns: 6
$ YEAR         <dbl> 1977, 1977, 1977, 1977, 1977, 1977, 1977, 1977, 1977, 197…
$ STATE        <chr> "Alabama", "Alabama", "Alabama", "Alabama", "Alabama", "A…
$ SEX          <chr> "Male", "Male", "Male", "Male", "Male", "Male", "Male", "…
$ RACE         <chr> "White", "White", "White", "White", "White", "White", "Wh…
$ AGE_GROUP    <chr> "Under 5 years", "5 to 9 years", "10 to 14 years", "15 to…
$ PERC_SUB_POP <dbl> 2.6123502, 2.9970356, 3.2545853, 3.5780690, 3.3324688, 2.…

Demographic Data (Donohue)

DONOHUE_AGE_GROUPS <- c("15 to 19 years",
                        "20 to 24 years",
                        "25 to 29 years",
                        "30 to 34 years",
                        "35 to 39 years")

dem_DONOHUE <- dem |>
  filter(AGE_GROUP %in% DONOHUE_AGE_GROUPS,
               SEX == "Male") |>
  mutate(AGE_GROUP = fct_collapse(AGE_GROUP, "20 to 39 years"=c("20 to 24 years",
                                                                "25 to 29 years",
                                                                "30 to 34 years",
                                                                "35 to 39 years")),
         AGE_GROUP = str_replace_all(string = AGE_GROUP, 
                                     pattern = " ", 
                                     replacement = "_")) |>
  group_by(YEAR, STATE, RACE, SEX, AGE_GROUP) |>
  summarize(PERC_SUB_POP = sum(PERC_SUB_POP), .groups = "drop") |>
  unite(col = "VARIABLE", RACE, SEX, AGE_GROUP, sep = "_") |>
  rename("VALUE" = PERC_SUB_POP)

dem_DONOHUE
# A tibble: 10,404 × 4
    YEAR STATE   VARIABLE                    VALUE
   <dbl> <chr>   <chr>                       <dbl>
 1  1977 Alabama Black_Male_15_to_19_years  1.55  
 2  1977 Alabama Black_Male_20_to_39_years  3.04  
 3  1977 Alabama Other_Male_15_to_19_years  0.0178
 4  1977 Alabama Other_Male_20_to_39_years  0.0642
 5  1977 Alabama White_Male_15_to_19_years  3.58  
 6  1977 Alabama White_Male_20_to_39_years 11.1   
 7  1977 Alaska  Black_Male_15_to_19_years  0.163 
 8  1977 Alaska  Black_Male_20_to_39_years  0.968 
 9  1977 Alaska  Other_Male_15_to_19_years  1.12  
10  1977 Alaska  Other_Male_20_to_39_years  2.73  
# … with 10,394 more rows

Demographic Data (Lott)

LOTT_AGE_GROUPS_NULL <- c("Under 5 years",
                          "5 to 9 years")

dem_LOTT <- dem |>
  filter(!(AGE_GROUP %in% LOTT_AGE_GROUPS_NULL) )|>
  mutate(AGE_GROUP = fct_collapse(AGE_GROUP,
                                  "10 to 19 years"=c("10 to 14 years", "15 to 19 years"),
                                  "20 to 29 years"=c("20 to 24 years", "25 to 29 years"),
                                  "30 to 39 years"=c("30 to 34 years", "35 to 39 years"),
                                  "40 to 49 years"=c("40 to 44 years", "45 to 49 years"),
                                  "50 to 64 years"=c("50 to 54 years", "55 to 59 years",
                                                     "60 to 64 years"),
                                  "65 years and over"=c("65 to 69 years", "70 to 74 years", 
                                                        "75 to 79 years", "80 to 84 years",
                                                        "85 years and over")),
         AGE_GROUP = str_replace_all(AGE_GROUP, " ", "_")) |>
  group_by(YEAR, STATE, RACE, SEX, AGE_GROUP) |>
  summarize(PERC_SUB_POP = sum(PERC_SUB_POP), .groups = "drop") |>
  unite(col = "VARIABLE", RACE, SEX, AGE_GROUP, sep = "_") |>
  rename("VALUE" = PERC_SUB_POP)

glimpse(dem_LOTT)
Rows: 62,424
Columns: 4
$ YEAR     <dbl> 1977, 1977, 1977, 1977, 1977, 1977, 1977, 1977, 1977, 1977, 1…
$ STATE    <chr> "Alabama", "Alabama", "Alabama", "Alabama", "Alabama", "Alaba…
$ VARIABLE <chr> "Black_Female_10_to_19_years", "Black_Female_20_to_29_years",…
$ VALUE    <dbl> 3.01067713, 2.32860137, 1.29295656, 1.18231753, 1.73263106, 1…

Combine: Population Data

population_data <- bind_rows(pop_77_79,
                             pop_80_89,
                             pop_90_99,
                             pop_00_10)

population_data <- population_data |>
  mutate(VARIABLE = "Population") |>
  rename("VALUE" = TOT_POP)

Wrangling: Police staffing

# the provided dataset has already had a bit of wrangling done for you
ps_data
# A tibble: 2,242 × 3
   data_year state_abbr officer_state_total
       <dbl> <chr>                    <dbl>
 1      1977 AK                         544
 2      1977 AL                        7380
 3      1977 AR                        3344
 4      1977 AS                           0
 5      1977 AZ                        6414
 6      1977 CA                       65596
 7      1977 CO                        7337
 8      1977 CT                        6051
 9      1977 CZ                           0
10      1977 DC                        4751
# … with 2,232 more rows

remove territories

state_of_interest_NULL <- c("AS", "GM", "CZ", "FS", "MP", "OT", "PR", "VI")

ps_data <- ps_data |>
  filter(!(state_abbr %in% state_of_interest_NULL)) 

Use state abbreviations

state_abb_data <- tibble("state_abbr" = state.abb, "STATE" = state.name)
state_abb_data <- state_abb_data |>
  mutate(state_abbr = str_replace(string = state_abbr, 
                                  pattern = "NE", 
                                  replacement = "NB")) |>
  add_row(state_abbr = "DC", STATE = "District of Columbia")

ps_data <- ps_data |> 
  left_join(state_abb_data, by = "state_abbr") |>
  select(-state_abbr) |> 
  rename(YEAR = "data_year",
         VALUE = "officer_state_total") |>
  mutate(VARIABLE = "officer_state_total")

ps_data
# A tibble: 1,938 × 4
    YEAR VALUE STATE                VARIABLE           
   <dbl> <dbl> <chr>                <chr>              
 1  1977   544 Alaska               officer_state_total
 2  1977  7380 Alabama              officer_state_total
 3  1977  3344 Arkansas             officer_state_total
 4  1977  6414 Arizona              officer_state_total
 5  1977 65596 California           officer_state_total
 6  1977  7337 Colorado             officer_state_total
 7  1977  6051 Connecticut          officer_state_total
 8  1977  4751 District of Columbia officer_state_total
 9  1977  1018 Delaware             officer_state_total
10  1977 24588 Florida              officer_state_total
# … with 1,928 more rows

Scaling

denominator_temp <- population_data |> 
  select(-VARIABLE) |>
  rename("Population_temp"=VALUE) 

ps_data <- ps_data |> 
  left_join(denominator_temp, by=c("STATE","YEAR")) |>
  mutate(VALUE = (VALUE * 100000) / Population_temp) |>
  mutate(VARIABLE = "police_per_100k_lag") |>
  select(-Population_temp)

head(ps_data)
# A tibble: 6 × 4
   YEAR VALUE STATE      VARIABLE           
  <dbl> <dbl> <chr>      <chr>              
1  1977  137. Alaska     police_per_100k_lag
2  1977  195. Alabama    police_per_100k_lag
3  1977  152. Arkansas   police_per_100k_lag
4  1977  264. Arizona    police_per_100k_lag
5  1977  293. California police_per_100k_lag
6  1977  272. Colorado   police_per_100k_lag

Wrangling: Poverty Rate

# A tibble: 5 × 6
  `NOTE: Number in thousands.` ...2  ...3   ...4              ...5         ...6 
  <chr>                        <chr> <chr>  <chr>             <chr>        <chr>
1 2018                         <NA>  <NA>    <NA>             <NA>          <NA>
2 STATE                        Total Number "Standard\nerror" Percent      "Sta…
3 Alabama                      4877  779    "65"              16           "1.3"
4 Alaska                       720   94     "9"               13.1         "1.2"
5 Arizona                      7241  929    "80"              12.80000000… "1.1…
colnames(poverty_rate_data) <- c("STATE", "Total", "Number", "Number_se",
                                 "Percent", "Percent_se")

poverty_rate_data <- poverty_rate_data |>
  filter(STATE != "STATE") |> 
  mutate(length_state = map_dbl(STATE, str_length)) |> # determine how long string in "STATE" column is
  filter(length_state < 100) |> # filter to only include possible state lengths
  mutate(STATE = str_replace(STATE, pattern = "D.C.", 
                              replacement = "District of Columbia" )) 

year_values <- poverty_rate_data |>
  filter(str_detect(STATE, "[:digit:]")) |>
  distinct(STATE)
year_values <- rep(pull(year_values, STATE), each = 52) # repeat values from STATE column 52 times each

poverty_rate_data <- poverty_rate_data |>
  mutate(year_value = year_values) |>
  select(-length_state) |>
  filter(str_detect(STATE, "[:alpha:]"))

poverty_rate_data <- poverty_rate_data |>
  filter(year_value != "2017") |> 
  filter(year_value != "2013 (18)") |>
  mutate(YEAR = str_sub(year_value, start = 1, end = 4)) |>
  select(-c(Number, Number_se, Percent_se, Total, year_value)) |>
  rename("VALUE" = Percent) |>
  mutate(VARIABLE = "Poverty_rate",
         YEAR = as.numeric(YEAR),
         VALUE = as.numeric(VALUE))

poverty_rate_data
# A tibble: 1,989 × 4
   STATE                VALUE  YEAR VARIABLE    
   <chr>                <dbl> <dbl> <chr>       
 1 Alabama               16    2018 Poverty_rate
 2 Alaska                13.1  2018 Poverty_rate
 3 Arizona               12.8  2018 Poverty_rate
 4 Arkansas              15.9  2018 Poverty_rate
 5 California            11.9  2018 Poverty_rate
 6 Colorado               9.1  2018 Poverty_rate
 7 Connecticut           10.2  2018 Poverty_rate
 8 Delaware               7.4  2018 Poverty_rate
 9 District of Columbia  14.7  2018 Poverty_rate
10 Florida               13.7  2018 Poverty_rate
# … with 1,979 more rows

Wrangling: Crime Data

crime_data <- crime_data[-((str_which(crime_data, "The figures shown in this column for the offense of rape were estimated using the legacy UCR definition of rape")-1): length(crime_data)+1)]

n_rows <- 2014-1977+1 # determine how many rows there are for each state
rep_cycle <- 4 + n_rows
rep_cycle_cut <- 2 + n_rows
colnames_crime <- (crime_data[4])

# specify which rows are to be deleted based on the file format
delete_rows <- c(seq(from = 2, 
                       to = length(crime_data),  
                       by = rep_cycle),
                 seq(from = 3, 
                       to = length(crime_data),
                       by = rep_cycle), 
                 seq(from = 4,
                       to = length(crime_data),
                       by = rep_cycle))
sort(delete_rows) # which rows are to be deleted
  [1]    2    3    4   44   45   46   86   87   88  128  129  130  170  171  172
 [16]  212  213  214  254  255  256  296  297  298  338  339  340  380  381  382
 [31]  422  423  424  464  465  466  506  507  508  548  549  550  590  591  592
 [46]  632  633  634  674  675  676  716  717  718  758  759  760  800  801  802
 [61]  842  843  844  884  885  886  926  927  928  968  969  970 1010 1011 1012
 [76] 1052 1053 1054 1094 1095 1096 1136 1137 1138 1178 1179 1180 1220 1221 1222
 [91] 1262 1263 1264 1304 1305 1306 1346 1347 1348 1388 1389 1390 1430 1431 1432
[106] 1472 1473 1474 1514 1515 1516 1556 1557 1558 1598 1599 1600 1640 1641 1642
[121] 1682 1683 1684 1724 1725 1726 1766 1767 1768 1808 1809 1810 1850 1851 1852
[136] 1892 1893 1894 1934 1935 1936 1976 1977 1978 2018 2019 2020 2060 2061 2062
[151] 2102 2103 2104
# convince yourself you did it right
# should these rows be deleted?
crime_data[44:46]
[1] ",,National or state crime,,,,,,,"                                                                                                   
[2] ",,Violent crime,,,,,,,"                                                                                                             
[3] "Year,Population,Violent crime total,Murder and nonnegligent Manslaughter,Legacy rape /1,Revised rape /2,Robbery,Aggravated assault,"
crime_data <- crime_data[-delete_rows]

# extract state labels from data
state_labels <- crime_data[str_which(crime_data, "Estimated crime in ")]
state_labels <- str_remove(state_labels, pattern = "Estimated crime in ")
state_label_order <- rep(state_labels, each = n_rows) # repeat n_rows times

crime_data <- crime_data[-str_which(crime_data, "Estimated crime")]
crime_data_sep <- read_csv(I(crime_data), col_names = FALSE) |> 
  select(-X6) # remove random extra-comma column

# get column names for later
colnames(crime_data_sep) <- c("Year", 
                              "Population", 
                              "Violent_crime_total",
                              "Murder_and_nonnegligent_Manslaughter",
                              "Legacy_rape",
                              "Revised_rape", 
                              "Robbery",
                              "Aggravated_assault")
# add column names in
crime_data_sep <- bind_cols(STATE = state_label_order, crime_data_sep)

crime_data <- crime_data_sep |>
  mutate(VARIABLE = "Viol_crime_count") |>
  rename("VALUE" = Violent_crime_total) |>
  rename("YEAR" = Year) |>
  select(YEAR,STATE, VARIABLE, VALUE)

crime_data
# A tibble: 1,938 × 4
    YEAR STATE   VARIABLE         VALUE
   <dbl> <chr>   <chr>            <dbl>
 1  1977 Alabama Viol_crime_count 15293
 2  1978 Alabama Viol_crime_count 15682
 3  1979 Alabama Viol_crime_count 15578
 4  1980 Alabama Viol_crime_count 17320
 5  1981 Alabama Viol_crime_count 18423
 6  1982 Alabama Viol_crime_count 17653
 7  1983 Alabama Viol_crime_count 16471
 8  1984 Alabama Viol_crime_count 17204
 9  1985 Alabama Viol_crime_count 18398
10  1986 Alabama Viol_crime_count 22616
# … with 1,928 more rows

Wrangling: RTC Laws

DAWpaper_p_62 <- DAWpaper[[62]]
str(DAWpaper_p_62, nchar.max = 1000) # see data
 chr "                                          Table A1: RTC Adoption Dates\n\n         State         Effective Date of RTC Law   Fraction of Year In Effect Year of Passage   RTC Date (Synthetic Controls Analysis)\n      Alabama                      1975                                                                        1975\n       Alaska                   10/1/1994                            0.252                                     1995\n       Arizona                  7/17/1994                            0.460                                     1995\n      Arkansas                  7/27/1995                            0.433                                     1996\n     California                    N/A                                                                            0\n      Colorado                  5/17/2003                            0.627                                     2003\n    Connecticut                    1970                                "| __truncated__
p_62 <- DAWpaper_p_62 |>
  str_split("\n") |>
  unlist() |>
  as_tibble() |>
  slice(-(1:2)) |> 
  rename(RTC = value) |>
  slice(-c(53:54)) |>  # physical page 60 marking; empty line removal
  mutate(RTC = str_replace_all(RTC, "\\s{40,}", "|N/A|"),
         RTC = str_trim(RTC, side = "left"),
         RTC = str_replace_all(RTC, "\\s{2,15}", "|"))

head(p_62)
# A tibble: 6 × 1
  RTC                                                                           
  <chr>                                                                         
1 State|Effective Date of RTC Law|Fraction of Year In Effect Year of Passage|RT…
2 Alabama||1975|N/A|1975                                                        
3 Alaska||10/1/1994||0.252|||1995                                               
4 Arizona||7/17/1994||0.460|||1995                                              
5 Arkansas||7/27/1995||0.433|||1996                                             
6 California||N/A|N/A|0                                                         
p_62 <- pull(p_62, RTC) |>
  str_split( "\\|{1,}")  # split data on "|" symbol

# get the tibble!
p_62 <- as_tibble(do.call(rbind, p_62)) # rbind and not bind_cols here b/c we have no column names yet

colnames(p_62) <- c("STATE",
                    "E_Date_RTC",
                    "Frac_Yr_Eff_Yr_Pass",
                    "RTC_Date_SA")

p_62 <- p_62 |>
  slice(-c(1, 53:nrow(p_62))) # remove unnecessary rows

RTC <- p_62 |> 
  select(STATE, RTC_Date_SA) |>
  rename(RTC_LAW_YEAR = RTC_Date_SA) |>
  mutate(RTC_LAW_YEAR = as.numeric(RTC_LAW_YEAR)) |>
  mutate(RTC_LAW_YEAR = case_when(RTC_LAW_YEAR == 0 ~ Inf,
                                  TRUE ~ RTC_LAW_YEAR))

Wrangling: Combining! (Donohue)

# combine after all that wrangling!
DONOHUE_DF <- bind_rows(dem_DONOHUE,
                        ue_rate_data,
                        poverty_rate_data,
                        crime_data,
                        population_data,
                        ps_data)
DONOHUE_DF
# A tibble: 20,247 × 4
    YEAR STATE   VARIABLE                    VALUE
   <dbl> <chr>   <chr>                       <dbl>
 1  1977 Alabama Black_Male_15_to_19_years  1.55  
 2  1977 Alabama Black_Male_20_to_39_years  3.04  
 3  1977 Alabama Other_Male_15_to_19_years  0.0178
 4  1977 Alabama Other_Male_20_to_39_years  0.0642
 5  1977 Alabama White_Male_15_to_19_years  3.58  
 6  1977 Alabama White_Male_20_to_39_years 11.1   
 7  1977 Alaska  Black_Male_15_to_19_years  0.163 
 8  1977 Alaska  Black_Male_20_to_39_years  0.968 
 9  1977 Alaska  Other_Male_15_to_19_years  1.12  
10  1977 Alaska  Other_Male_20_to_39_years  2.73  
# … with 20,237 more rows
# to wide format!
DONOHUE_DF <- DONOHUE_DF |>
  pivot_wider(names_from = "VARIABLE",
              values_from = "VALUE")

# add in RTC data!
DONOHUE_DF <- DONOHUE_DF |>
  left_join(RTC , by = c("STATE")) |>
  mutate(RTC_LAW = case_when(YEAR >= RTC_LAW_YEAR ~ TRUE,
                              TRUE ~ FALSE)) |>
 drop_na() # drop rows with missing information

# filter to only data where RTC laws were adopted between 1980-2010
# have crime data pre- and post-adoption this way
baseline_year <- min(DONOHUE_DF$YEAR)
censoring_year <- max(DONOHUE_DF$YEAR)

DONOHUE_DF <- DONOHUE_DF |>
  mutate(TIME_0 = baseline_year,
         TIME_INF = censoring_year) |>
  filter(RTC_LAW_YEAR > TIME_0)

# calculate violent crime rate; put population/crime on log scale
DONOHUE_DF <- DONOHUE_DF |>
  mutate(Viol_crime_rate_1k = (Viol_crime_count*1000)/Population,
         Viol_crime_rate_1k_log = log(Viol_crime_rate_1k),
         Population_log = log(Population))

DONOHUE_DF |>
  slice_sample(n = 10) |>
  glimpse()
Rows: 10
Columns: 20
$ YEAR                      <dbl> 2007, 1982, 1986, 2008, 2001, 1994, 1982, 19…
$ STATE                     <chr> "South Carolina", "Wisconsin", "Kentucky", "…
$ Black_Male_15_to_19_years <dbl> 1.3360605, 0.2259944, 0.3777841, 0.2314658, …
$ Black_Male_20_to_39_years <dbl> 3.7130944, 0.6522053, 1.1693897, 0.7158293, …
$ Other_Male_15_to_19_years <dbl> 0.12933973, 0.06599873, 0.02546219, 0.214153…
$ Other_Male_20_to_39_years <dbl> 0.4677877, 0.2201790, 0.1076788, 0.7418817, …
$ White_Male_15_to_19_years <dbl> 2.329578, 4.280084, 3.786301, 3.309214, 2.85…
$ White_Male_20_to_39_years <dbl> 9.151731, 15.368907, 15.100435, 12.062884, 1…
$ Unemployment_rate         <dbl> 5.7, 10.5, 9.4, 3.3, 3.7, 5.1, 15.4, 8.8, 5.…
$ Poverty_rate              <dbl> 14.1, 9.5, 17.7, 10.6, 8.9, 15.6, 16.2, 18.6…
$ Viol_crime_count          <dbl> 34630, 9077, 12467, 5570, 30585, 39240, 5980…
$ Population                <dbl> 4444110, 4728879, 3687821, 1796378, 6397634,…
$ police_per_100k_lag       <dbl> 373.3256, 251.3704, 189.7326, 279.9522, 326.…
$ RTC_LAW_YEAR              <dbl> 1997, 2012, 1997, 2007, Inf, 2004, 2001, 199…
$ RTC_LAW                   <lgl> TRUE, FALSE, FALSE, TRUE, FALSE, FALSE, FALS…
$ TIME_0                    <dbl> 1980, 1980, 1980, 1980, 1980, 1980, 1980, 19…
$ TIME_INF                  <dbl> 2010, 2010, 2010, 2010, 2010, 2010, 2010, 20…
$ Viol_crime_rate_1k        <dbl> 7.792336, 1.919482, 3.380587, 3.100684, 4.78…
$ Viol_crime_rate_1k_log    <dbl> 2.0531407, 0.6520556, 1.2180494, 1.1316226, …
$ Population_log            <dbl> 15.30709, 15.36920, 15.12055, 14.40128, 15.6…

Wrangling: Combining! (Lott)

LOTT_DF <- bind_rows(dem_LOTT,
                     ue_rate_data,
                     poverty_rate_data,
                     crime_data,
                     population_data,
                     ps_data) |>
  pivot_wider(names_from = "VARIABLE",
              values_from = "VALUE") |>
  left_join(RTC , by = c("STATE")) |>
  mutate(RTC_LAW = case_when(YEAR >= RTC_LAW_YEAR ~ TRUE,
                              TRUE ~ FALSE)) |>
   drop_na()

baseline_year <- min(LOTT_DF$YEAR)
censoring_year <- max(LOTT_DF$YEAR)

LOTT_DF <- LOTT_DF |>
  mutate(TIME_0 = baseline_year,
         TIME_INF = censoring_year) |>
  filter(RTC_LAW_YEAR > TIME_0)

LOTT_DF <- LOTT_DF |>
  mutate(Viol_crime_rate_1k = (Viol_crime_count*1000)/Population,
         Viol_crime_rate_1k_log = log(Viol_crime_rate_1k),
         Population_log = log(Population))

LOTT_DF
# A tibble: 1,364 × 50
    YEAR STATE   Black…¹ Black…² Black…³ Black…⁴ Black…⁵ Black…⁶ Black…⁷ Black…⁸
   <dbl> <chr>     <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>
 1  1980 Alaska   0.264    0.443  0.201   0.116   0.0924 0.0264    0.297   0.695
 2  1980 Arizona  0.287    0.278  0.165   0.119   0.136  0.103     0.311   0.338
 3  1980 Arkans…  1.82     1.50   0.842   0.634   1.02   1.16      1.81    1.26 
 4  1980 Califo…  0.780    0.815  0.581   0.394   0.456  0.292     0.808   0.815
 5  1980 Colora…  0.352    0.388  0.245   0.172   0.164  0.103     0.377   0.467
 6  1980 Delawa…  1.87     1.68   1.14    0.783   0.952  0.670     1.81    1.36 
 7  1980 Distri…  6.53     7.54   5.18    3.89    6.10   4.15      6.32    6.40 
 8  1980 Florida  1.50     1.37   0.912   0.679   0.812  0.604     1.49    1.20 
 9  1980 Georgia  2.90     2.78   1.85    1.22    1.56   1.35      2.92    2.45 
10  1980 Hawaii   0.0930   0.215  0.0776  0.0253  0.0197 0.00738   0.180   0.656
# … with 1,354 more rows, 40 more variables: Black_Male_30_to_39_years <dbl>,
#   Black_Male_40_to_49_years <dbl>, Black_Male_50_to_64_years <dbl>,
#   Black_Male_65_years_and_over <dbl>, Other_Female_10_to_19_years <dbl>,
#   Other_Female_20_to_29_years <dbl>, Other_Female_30_to_39_years <dbl>,
#   Other_Female_40_to_49_years <dbl>, Other_Female_50_to_64_years <dbl>,
#   Other_Female_65_years_and_over <dbl>, Other_Male_10_to_19_years <dbl>,
#   Other_Male_20_to_29_years <dbl>, Other_Male_30_to_39_years <dbl>, …

Why?

❓ Why are there different dimensions for LOTT vs DONOHUE??

dim(LOTT_DF)
[1] 1364   50
dim(DONOHUE_DF)
[1] 1364   20

Things to Consider

  • How RMarkdown documents work
  • How to control what is executed each time you render
  • What to do with all this after this lecture?

Save

save(LOTT_DF, DONOHUE_DF, file = "data/wrangled/wrangled_data_rtc.rda")